CMU 11642 的课程笔记。这一篇讲两种 document representation 方法，Controlled vocabulary index terms vs Free-text or full-text index terms

Overview

Controlled vocabulary index terms

从一个 well-defined classification scheme 中挑取 term，比较有名的开放分类目录有 dmoz。

Structure

broad vocabularies 来描述概括性的 topic; detailed vocabularies 来描述更加细节的 topic。一个 well-defined classification scheme 主要有以下构成：

用于识别文档主题的一组规则
指定的词库
一组索引的术语(term)
用于分配索引项的一组规则

Advantages and Disadvantages

优点：

高的召回率
支持浏览和搜索
在一些领域非常流行（如医学，法律，专利等）

缺点：

coverage vs detail tradeoff
人工创建和维护的成本很高
人们难以一致地分配文件
检索受限制

Free-text or full-text index terms

从原文档或者相关文档中挑取 term。Free-text or full-text indexing 用的是 uncontrolled vocabulary。Free-text 和 full-text indexing 的区别在于前者只用了部分的 term 作为 index，而后者用了几乎所有的 term 来作为 index。

How to select terms?

selected terms 人工选择
all terms 就不用考虑选择的问题

Advantages and Disadvantages

优点：

索引词汇保证与文档的内容有很好的匹配
无需学习（可能会很复杂的）受控词表
可能比控制词汇更容易自动化

缺点：

更可能会导致词汇比匹配
比如文档里有 automobile，query 说是 car，就不能 match

Process

Search engine uses shallow language analysis and heuristics to convert lexical tokens (usually words) into index terms (features)

Heuristic methods: map tokens to indexing terms

Stopwords

一些 stopwords 如 the, a 并没有实际意义，删除 stopwords 可以减小 index size，提高准确性和效率，然而也会带来一些问题，如无法处理一些 query(eg. To be or not to be, let it be)。解决方案是我们把 index 的 stopwords 存下来，在处理 query 的时候去掉 query 里的 stopwords，如果 stopwords 在 query terms 里占比很高，或者用户明确要求留下 stopwords (eg. +the last)，就把 stopwords 留下。

优点：

丢掉不具有内容信息的词
大幅减少索引大小，减少检索时间
提高准确性

缺点：

难以满足某些特殊的 query (eg. To be or not to be, let it be)

创建 stopword list
通过 frequency analysis 和 manual review 来完成。

基于频率对字典进行排序
检查最常用的 term
检查查询日志，查看哪些频繁的 term 可能很重要

Normalization

通常我们需要对 token 进行规范化，比如大小写转换，以便下一步处理。

优点:

提高召回率，匹配更多查询

缺点：

如 Apple 可以用作公司名称，而 apple 将被视为一种水果。

Morphological analysis

其实是一种映射。Map a token to another token (“stemming”,”conflation”) eg. images -> image
常用的 stemming algorithms 有 Porter, KSTEM 等，一般来说，Porter 和 KSTEM 能产生的差不多准确的 search results。Porter 更加的 aggressive，可能会出现一些不是词的词，而 KSTEM 更加的保守，很少会产生 smaller conflation classes，更加像”词”。
对于 企业检索 而言，corpus 相对较小，recall 通常很重要，所以用户为了得到更多的相关文档，对 stemming mistakes 容忍度较高。而对于 网页检索 而言，corpus 很大，recall 并没有那么重要，precision 更重要，所以对 stemming/lemmatization mistakes 容忍度更低，所以并不使用。Google 之前是不做 stemming 的，现在似乎开始做了。

这些技术都是因语言而异的，不同的语言有不同的语法规则，不能一概而论。

优点:

Conflating variations of a word
更准确地表示文档
匹配更广泛的查询

缺点：

效果不一致，Stemming 的结果可能不是词语
term 可能被错误地分组，不相关的词可能具有相同的 stem（例如，Apple,apple）
复杂的 morphological analysis 可能非常缓慢

Phrases

对 phrase 的处理，一般有两种方案。

一种是 precoordinate(one inverted list)，把词组存为 index，比如 interest rate，inverted list 存成 interest_rate，在用户查询时 interest rate 时，替换成 interested_rate 进行 match。这种做法耗费了很多空间，怎样选择要存储的词组也是个问题，事实上可能会存很多永远不会被查询的词组。
另一种是 postcoordinate(more than one inverted lists)，对 query 进行 reformulation, 如 interest rate 变成 #NEAR/1(interest rate)，然后进行 match。这种方法查询时会有些慢，然而不必纠结于词组的选择。

De-compounding

computer-virus -> computer,virus

优点

更准确地表示文档
匹配更广泛的查询

缺点

N-grams 像 “roe v. wade” 会变得没有意义

其它

Basic lexical processing

tokens
stopwords
morphologial processing (“stemming”)

Other representations

phrases, citations and inlink text, paths and urls

Multiple representations